|
In artificial intelligence, Thompson sampling,〔 named after William R. Thompson, is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists in choosing the action that maximizes the expected reward with respect to a randomly drawn belief. == Description == Consider a set of contexts , a set of actions , and rewards in . In each round, the player obtains a context , plays an action and receives a reward following a distribution that depends on the context and the issued action. The aim of the player is to play actions such as to maximize the cumulative rewards. The elements of Thompson sampling are as follows: # a likelihood function ; # a set of parameters of the distribution of r; # a prior distribution on these parameters; # past observations triplets ; # a posterior distribution , where is the likelihood function. Thompson sampling consists in playing the action according to the probability that it maximizes the expected reward, i.e. : where is the indicator function. In practice, the rule is implemented by sampling, in each round, a parameter from the posterior , and choosing the action that maximizes , i.e. the expected reward given the parameter, the action and the current context. Conceptually, this means that the player instantiates his beliefs randomly in each round, and then he acts optimally according to them. 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Thompson sampling」の詳細全文を読む スポンサード リンク
|